Align: Scaling Up Visual And Vision-Language Representation Learning With Noisy Text Supervision